Exploring YouTube Video Trends and Predictive Insights¶
Author: Alinzon Simon
Course: Adv. Programming Techniques
School: School of Professional Studies
Date: May 2025
Abstract¶
The YouTube recommendation engine significantly influences content visibility and user engagement, contributing to a dynamic, data-driven media landscape. Analyzing trending videos provides valuable insights into user engagement patterns across different regions. This study investigates the popularity of video categories among the top 50 trending videos in multiple countries, examining how regional preferences shape content trends. Additionally, it explores the correlation between video categories and user engagement metrics, specifically the number of likes and views a video receives. To further deepen the engagement analysis, this study evaluates the feasibility of predicting video popularity, measured by likes and views, based on features such as description content, category, and region. By leveraging machine learning techniques, the research aims to develop predictive models that identify key drivers of virality, incorporating statistical analysis and natural language processing (NLP). Finally, sentiment analysis is applied to video descriptions to evaluate how emotional tone, particularly expressions of joy and happiness, influences user interaction. By mapping sentiment to engagement metrics like likes, comments, and views, this research aims to uncover whether emotionally charged descriptions lead to higher engagement levels.The findings from this study contribute to a broader understanding of digital media trends and user behavior, offering insights for content creators, marketers, and platform developers seeking to optimize engagement strategies.
Introduction¶
With the rise of digital content consumption, YouTube has become a dominant platform shaping media trends across the globe. The dynamics of trending videos offer valuable insights into user preferences and engagement patterns, influenced by factors such as video category, regional audience behavior, and emotional sentiment. First, we investigate the most popular YouTube video categories among the top 50 trending videos in different countries, analyzing how regional preferences influence content trends. Second, we assess the correlation between video categories and the number of likes they receive, identifying patterns in audience engagement. Additionally, we explore the feasibility of predicting video popularity—measured through likes and views based on attributes like description content, category, and geographic region. Using machine learning techniques, this study seeks to identify key factors driving engagement. Finally, we examine the role of emotions and sentiments embedded within video descriptions to determine their impact on user interactions, specifically whether positive emotional tones like joy and happiness contribute to higher levels of likes and comments.
Project Overview¶
Objective¶
This project explores the dynamics of YouTube’s trending video ecosystem by analyzing the top 50 trending videos across multiple countries. The goal is to understand content trends, user engagement patterns, and the emotional impact of video descriptions. Through statistical analysis and machine learning, this study provides actionable insights into what drives video popularity and viewer interaction.
Its primary objectives include:
- Understanding regional preferences in content categories
- Examining user engagement patterns (views, likes, comments)
- Developing predictive models for video popularity
- Assessing the emotional tone of video descriptions and its impact on engagement
Dataset Description¶
- Video Metadata: title, description, publishedat, video_id, channel_title
- Categorical: category_name, region_name, livebroadcastcontent
- Engagement Metrics: view_count, like_count, comment_count, favoritecount
Research Questions¶
What are the most popular YouTube video categories across different countries in the top 50 trending videos, and how do regional preferences vary for these trending videos?
Hypothesis: The popularity of video categories significantly differs across regions, with some categories (eg. music, entertainment) being universally popular and others (eg. news, sports) reflecting regional preferences.
How does the category of a YouTube video correlate with the number of likes it receives?
Hypothesis: Certain categories, such as entertainment and music, receive consistently higher like counts than informational categories.
Can we predict the number of likes or views a YouTube video will receive based on its description, category, or region?
Hypothesis: Video metadata (category, region) and description content can be used to build models that effectively predict likes and views, with region and category being strong predictors.
How do emotions and sentiments in video descriptions relate to engagement metrics, such as likes, comments, and views? Specifically, do descriptions with emotional tones like joy and happiness result in a higher number of likes and comments?
Hypothesis: Descriptions expressing positive emotions (e.g., joy, excitement) correlate with higher user engagement, especially in the form of likes and comments.
The credentials were saved in a .env file for security reasons and imported into the session to begin collecting the data. For this step a function called load_env_variables was created to load the environment variables from .env file.
def load_env_variables():
from dotenv import load_dotenv
import os
load_dotenv()
env_variables= {}
env_variables["API"] = os.getenv("API_KEY")
env_variables["ENV"] = os.getenv("ENVIRONMENT")
env_variables["MongoClient"] = os.getenv("MONGODB_URL")
env_variables["DB_NAME"] = os.getenv("DB_NAME")
return env_variables
I'm planning to call the API and save the data into MongoDB so I will define a function.
def connect_to_mongo(mongo_url,db_name, collection_name):
from pymongo import MongoClient
client = MongoClient(mongo_url)
db = client[db_name]
collection = db[collection_name]
return collection
Now, I will create three functions. One to get the available categories per region, another one to get the full name of the regions and the last one to get the trending videos.
def get_video_category(api_key,region_code='US'):
import requests
base_url = 'https://www.googleapis.com/youtube/v3/videoCategories'
url = f'{base_url}?part=snippet®ionCode={region_code}&key={api_key}'
response = requests.get(url)
if response.status_code == 200:
categories = response.json().get('items', [])
category_dict = {
cat['id']: {
'id': cat['id'],
'title': cat['snippet']['title'],
'region_code': region_code
}
for cat in categories
}
return category_dict
def get_available_regions(api_key):
import requests
base_url = 'https://www.googleapis.com/youtube/v3/i18nRegions'
url = f'{base_url}?part=snippet&key={api_key}'
response = requests.get(url)
if response.status_code == 200:
regions = []
all_regions = response.json().get('items', [])
for item in all_regions:
regions_details = {
'id': item['id'],
'name': item['snippet']['name'],
'gl': item['snippet']['gl'],
}
regions.append(regions_details)
return regions
return []
def get_youtube_trending_videos(api_key, region_code='US', max_results=10):
from googleapiclient.discovery import build
try:
youtube = build('youtube', 'v3', developerKey=api_key)
request = youtube.videos().list(
part='snippet,statistics',
chart='mostPopular',
regionCode=region_code,
maxResults=max_results
)
response = request.execute()
videos = []
for item in response.get('items', []):
video_details = {
'title': item['snippet']['title'],
'publishedAt': item['snippet']['publishedAt'],
'description': item['snippet']['description'],
'video_id': item['id'],
'channel_title': item['snippet']['channelTitle'],
'categoryId': item['snippet']['categoryId'],
'liveBroadcastContent': item['snippet']['liveBroadcastContent'],
'view_count': item['statistics'].get('viewCount', 'N/A'),
'like_count': item['statistics'].get('likeCount', 'N/A'),
'comment_count': item['statistics'].get('commentCount', 'N/A'),
'favoriteCount': item['statistics'].get('favoriteCount', 'N/A'),
'thumbnail': item['snippet']['thumbnails']['default']['url'],
'region_code': region_code,
}
videos.append(video_details)
return videos
except Exception as e:
print(f"An error occurred: {e}")
return None
In the next step, I will call all the functions I created so I can import the data into MongoDB. Three different Collections were created in MongoDB.
- Youtube Trending Videos
api_key = load_env_variables().get('API')
env = load_env_variables().get('ENV')
MongoClient = load_env_variables().get('MongoClient')
db_name = 'YouTubeTrending'
collection_name = 'Videos'
client = connect_to_mongo(MongoClient, db_name, collection_name)
client.create_index([('video_id', ASCENDING)], unique=True)
trending_videos_response = get_youtube_trending_videos(api_key, max_results=1000)
client.insert_many(trending_videos_response, ordered=False)
- Videos Regions
collection_name3 = 'Videos_Regions'
client3 = connect_to_mongo(MongoClient, db_name, collection_name3)
regions_list = get_available_regions(api_key)
client3.insert_many(regions_list)
- Videos Categories
This function requires a region code as an input, so I created a CSV file containing that information for later import into a DataFrame. Using a for loop, the function was executed for each value in the dataset.
collection_name2 = 'Videos_Categories'
client2 = connect_to_mongo(MongoClient, db_name, collection_name2)
file_path = 'data/YouTubeTrending.Videos_Regions.csv'
csv_data = load_csv_file(file_path)
for(index, row) in csv_data.iterrows():
region_code =row['id']
print(f"Current Region: {region_code}")
category_dict = get_video_category(api_key, region_code=region_code)
if category_dict:
client2.insert_many(list(category_dict.values()))
else:
print(f"No categories found for region: {region_code}")
2. Data Cleansing¶
The goal of this step is to improve data quality, enabling better decision-making and increasing the efficiency of data analysis.
2.1. Load the Data¶
To explore the data, I created a Python class called DBConnector, which contains predefined methods that I will use to insert, extract, and update records in MongoDB. As mentioned previously, the data was extracted by calling the YouTube API, loaded into MongoDB to avoid multiple API calls, and then retrieved into a dataframe for further exploration.
class DBConnector:
def __init__(self, mongo_url, db_name, collection_name):
from pymongo import MongoClient
self.client = MongoClient(mongo_url)
self.db = self.client[db_name]
self.collection = self.db[collection_name]
def insert_one(self, data):
return self.collection.insert_one(data)
def insert_many(self, data):
return self.collection.insert_many(data)
def find(self, query):
return self.collection.find(query)
def find_all(self):
return self.collection.find()
def update_one(self, query, update):
return self.collection.update_one(query, update)
def delete_one(self, query):
return self.collection.delete_one(query)
Now, I will import the credentials, followed by importing Pandas to begin working with the DataFrames.
env_variables = load_env_variables()
mongo_url = env_variables["MongoClient"]
db_name = env_variables["DB_NAME"]
collection_videos = "Videos"
collection_categories = "Videos_Categories"
collection_regions = "Videos_Regions"
import pandas as pd
2.1.1 Importing Videos data from MongoDB¶
Videos collection from MongoDB will be imported into a list, then transformed into a dataframe for data cleansing.
db_connector = DBConnector(mongo_url, db_name, collection_videos)
videos =list(db_connector.find_all())
df_videos = pd.DataFrame(videos)
print(f"Total records in {collection_videos} collection: {len(df_videos)}")
Total records in Videos collection: 5496
print( df_videos.iloc[:, [1,2,5,7,8,9,10,13]].head(5))
title publishedAt \
0 Alappuzha Gymkhana - Trailer | Khalid Rahman |... 2025-03-26T04:30:01Z
1 ARGENTINA vs. BRASIL [4-1] | RESUMEN | ELIMINA... 2025-03-26T02:57:42Z
2 فاجأت ابوي بسيارته قبل 30 سنة ! 2025-03-25T22:26:40Z
3 SIKANDAR Official Trailer - Salman Khan, Rashm... 2025-03-23T11:45:34Z
4 🚨 مباشر - كأس عاصمة مصر: مباراة الأهلي ضد طلائ... 2025-03-24T21:42:49Z
channel_title liveBroadcastContent view_count like_count \
0 Think Music India none 2928449 88414
1 CONMEBOL none 13077910 N/A
2 Ayman Az l ايمن none 479988 31194
3 NadiadwalaGrandson none 55682056 896040
4 kora plus none 936183 6772
comment_count region_code
0 2385 AE
1 13041 AE
2 908 AE
3 74621 AE
4 51 AE
2.1.2 Import Category data from MongoDB¶
Videos_Categories collection from MongoDB will be imported into a list, then transformed into a dataframe for data cleansing.
db_connector2 = DBConnector(mongo_url, db_name, collection_categories)
categories = list(db_connector2.find_all())
df_categories = pd.DataFrame(categories)
print(f"Total records in {collection_categories} collection: {len(df_categories)}")
Total records in Videos_Categories collection: 3411
print(df_categories.iloc[:, [1,2,3]].head(5))
id title region_code 0 1 Film & Animation AE 1 2 Autos & Vehicles AE 2 10 Music AE 3 15 Pets & Animals AE 4 17 Sports AE
2.1.3 Import Region data from MongoDB¶
Videos_Regions collection from MongoDB will be imported into a list, then transformed into a dataframe for data cleansing.
db_connector3 = DBConnector(mongo_url, db_name, collection_regions)
regions = list(db_connector3.find_all())
df_regions = pd.DataFrame(regions)
print(f"Total records in {collection_regions} collection: {len(df_regions)}")
print(df_regions.iloc[:, :].head(5))
Total records in Videos_Regions collection: 110
_id id name gl
0 67e35d0970bd48b9b7e4442b AE United Arab Emirates AE
1 67e35d0970bd48b9b7e4442c BH Bahrain BH
2 67e35d0970bd48b9b7e4442d DZ Algeria DZ
3 67e35d0970bd48b9b7e4442e EG Egypt EG
4 67e35d0970bd48b9b7e4442f IQ Iraq IQ
Now that the data is loaded in some data frames I will proceed to generate the plots
2.2. Understand the Data¶
Before manipulating the data, it's essential to understand the structure and columns we'll be working with. In this case, we have three DataFrames: df_videos, df_categories, and df_regions.
2.2.1 df_videos DataFrame¶
In order to understand the structure of this DataFrame we will use info() method
print(df_videos.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5496 entries, 0 to 5495 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 5496 non-null object 1 title 5496 non-null object 2 publishedAt 5496 non-null object 3 description 5496 non-null object 4 video_id 5496 non-null object 5 channel_title 5496 non-null object 6 categoryId 5496 non-null object 7 liveBroadcastContent 5496 non-null object 8 view_count 5496 non-null object 9 like_count 5496 non-null object 10 comment_count 5496 non-null object 11 favoriteCount 5496 non-null object 12 thumbnail 5496 non-null object 13 region_code 5496 non-null object dtypes: object(14) memory usage: 601.3+ KB None
As a result of the previous step, we identified several columns with incorrect data types: categoryID, view_count, like_count, comment_count, and favoriteCount.
print(df_videos.head())
_id \
0 67e47e6bdd90a0f80b10a53d
1 67e47e6bdd90a0f80b10a53e
2 67e47e6bdd90a0f80b10a53f
3 67e47e6bdd90a0f80b10a540
4 67e47e6bdd90a0f80b10a541
title publishedAt \
0 Alappuzha Gymkhana - Trailer | Khalid Rahman |... 2025-03-26T04:30:01Z
1 ARGENTINA vs. BRASIL [4-1] | RESUMEN | ELIMINA... 2025-03-26T02:57:42Z
2 فاجأت ابوي بسيارته قبل 30 سنة ! 2025-03-25T22:26:40Z
3 SIKANDAR Official Trailer - Salman Khan, Rashm... 2025-03-23T11:45:34Z
4 🚨 مباشر - كأس عاصمة مصر: مباراة الأهلي ضد طلائ... 2025-03-24T21:42:49Z
description video_id \
0 #AlappuzhaGymkhana #Naslen #LukmanAvaran #Gana... acCVmR5RrN0
1 ¡Goleada de #Argentina en el clásico! 👏🇦🇷\n\n#... s0ab0CdGWMo
2 انا ايمن الي يحب الكومنتات عطني رايك بالمقطع و... n9wHx1aFbzI
3 Presenting the Official Trailer of Sikandar! 🔥... BAk5ZCoTWY8
4 🚨 مباشر - كأس عاصمة مصر: مباراة الأهلي ضد طلائ... DJxK5tHOpis
channel_title categoryId liveBroadcastContent view_count like_count \
0 Think Music India 1 none 2928449 88414
1 CONMEBOL 17 none 13077910 N/A
2 Ayman Az l ايمن 24 none 479988 31194
3 NadiadwalaGrandson 24 none 55682056 896040
4 kora plus 17 none 936183 6772
comment_count favoriteCount thumbnail \
0 2385 0 https://i.ytimg.com/vi/acCVmR5RrN0/default.jpg
1 13041 0 https://i.ytimg.com/vi/s0ab0CdGWMo/default.jpg
2 908 0 https://i.ytimg.com/vi/n9wHx1aFbzI/default.jpg
3 74621 0 https://i.ytimg.com/vi/BAk5ZCoTWY8/default.jpg
4 51 0 https://i.ytimg.com/vi/DJxK5tHOpis/default.jpg
region_code
0 AE
1 AE
2 AE
3 AE
4 AE
2.2.2 df_categories DataFrame¶
For this DataFrame we will also use info().
print(df_categories.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3411 entries, 0 to 3410 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 3411 non-null object 1 id 3411 non-null object 2 title 3411 non-null object 3 region_code 3411 non-null object dtypes: object(4) memory usage: 106.7+ KB None
We can identify incorrect data types for the id column, which is especially important if we intend to use this column for join operations.
print(df_categories.head())
_id id title region_code 0 67e47f88f67e9d2ee6631545 1 Film & Animation AE 1 67e47f88f67e9d2ee6631546 2 Autos & Vehicles AE 2 67e47f88f67e9d2ee6631547 10 Music AE 3 67e47f88f67e9d2ee6631548 15 Pets & Animals AE 4 67e47f88f67e9d2ee6631549 17 Sports AE
2.2.2 df_regions DataFrame¶
Same as the other DataFrames we will also use info().
print(df_regions.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110 entries, 0 to 109 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 110 non-null object 1 id 110 non-null object 2 name 110 non-null object 3 gl 110 non-null object dtypes: object(4) memory usage: 3.6+ KB None
In this case, we will update 'id' to 'region_code' to ensure name consistency across all data frames.
print(df_regions.head())
_id id name gl 0 67e35d0970bd48b9b7e4442b AE United Arab Emirates AE 1 67e35d0970bd48b9b7e4442c BH Bahrain BH 2 67e35d0970bd48b9b7e4442d DZ Algeria DZ 3 67e35d0970bd48b9b7e4442e EG Egypt EG 4 67e35d0970bd48b9b7e4442f IQ Iraq IQ
2.3. Fix Data Types and column Names¶
df_videos
We will fix the data types and column names for categoryID, view_count, like_count, comment_count, and favoriteCount. For the count columns any NA value will be replaced with 0.
df_videos.rename(columns={'categoryId': 'category_id'}, inplace=True)
df_videos['category_id'] = df_videos['category_id'].astype(int)
df_videos['view_count'] = pd.to_numeric(df_videos['view_count'], errors='coerce')
df_videos['view_count'] = df_videos['view_count'].fillna(0).astype(int)
df_videos['like_count'] = pd.to_numeric(df_videos['like_count'], errors='coerce')
df_videos['like_count'] = df_videos['like_count'].fillna(0).astype(int)
df_videos['comment_count'] = pd.to_numeric(df_videos['comment_count'], errors='coerce')
df_videos['comment_count'] = df_videos['comment_count'].fillna(0).astype(int)
df_videos['favoriteCount'] = pd.to_numeric(df_videos['favoriteCount'], errors='coerce')
df_videos['favoriteCount'] = df_videos['favoriteCount'].fillna(0).astype(int)
print(df_videos.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5496 entries, 0 to 5495 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 5496 non-null object 1 title 5496 non-null object 2 publishedAt 5496 non-null object 3 description 5496 non-null object 4 video_id 5496 non-null object 5 channel_title 5496 non-null object 6 category_id 5496 non-null int64 7 liveBroadcastContent 5496 non-null object 8 view_count 5496 non-null int64 9 like_count 5496 non-null int64 10 comment_count 5496 non-null int64 11 favoriteCount 5496 non-null int64 12 thumbnail 5496 non-null object 13 region_code 5496 non-null object dtypes: int64(5), object(9) memory usage: 601.3+ KB None
df_categories
We will fix the data types and column names for id.
df_categories.rename(columns={'id': 'category_id'}, inplace=True)
df_categories['category_id'] = df_videos['category_id'].astype(int)
print(df_categories.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3411 entries, 0 to 3410 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 3411 non-null object 1 category_id 3411 non-null int64 2 title 3411 non-null object 3 region_code 3411 non-null object dtypes: int64(1), object(3) memory usage: 106.7+ KB None
- df_regions
Column rename for id to region_code and name to region_name
df_regions.rename(columns={'id': 'region_code'}, inplace=True)
df_regions.rename(columns={'name': 'region_name'}, inplace=True)
print(df_regions.info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 110 entries, 0 to 109 Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 _id 110 non-null object 1 region_code 110 non-null object 2 region_name 110 non-null object 3 gl 110 non-null object dtypes: object(4) memory usage: 3.6+ KB None
For df_videos, df_categories and df_regions, I will use lowercase for all the columns and replace space with underscore.
df_videos.columns = df_videos.columns.str.lower().str.replace(' ', '_')
df_categories.columns = df_categories.columns.str.lower().str.replace(' ', '_')
df_regions.columns = df_regions.columns.str.lower().str.replace(' ', '_')
2.4. Remove Missing values and duplicate rows¶
2.4.1 Remove rows with null values¶
df_videos = df_videos.dropna(how='all')
df_categories = df_categories.dropna(how='all')
df_regions = df_regions.dropna(how='all')
2.4.1 Remove duplicate rows¶
df_videos = df_videos.drop_duplicates()
df_categories = df_categories.drop_duplicates()
df_regions = df_regions.drop_duplicates()
Now I will generate a final DataFrame with the clean data.
UT_Videos_df = pd.merge(df_videos, df_categories[['category_id','region_code','title']], on=['category_id','region_code'],how="left",suffixes=('', '_category'))
UT_Videos_df = UT_Videos_df.rename(columns={'title_category': 'category_name'})
UT_Videos_df = pd.merge(UT_Videos_df, df_regions[['region_code','region_name']], on=['region_code'],how="left")
print(UT_Videos_df.columns)
Index(['_id', 'title', 'publishedat', 'description', 'video_id',
'channel_title', 'category_id', 'livebroadcastcontent', 'view_count',
'like_count', 'comment_count', 'favoritecount', 'thumbnail',
'region_code', 'category_name', 'region_name'],
dtype='object')
3. Exploratory Data Analysis¶
Top 10 Trending Categories by Total View Count¶
This chart shows the top 10 YouTube video categories based on total views. From the bars, we can see that the Comedy category has the highest total views across all regions, reaching approximately 32.77 billion views. It is followed closely by Shorts videos. Interestingly, the News & Politics category also appears in the top 10, highlighting significant viewer interest in current events.
import plotly.express as px
top_categories = (
UT_Videos_df.groupby('category_name')['view_count']
.sum()
.sort_values(ascending=False)
.head(10)
.reset_index()
)
fig = px.bar(
top_categories,
x='category_name',
y='view_count',
title='Top 10 Trending Categories by Total View Count',
labels={'category_name': 'Category', 'view_count': 'Total Views'},
color_discrete_sequence=['#FF0000'] # YouTube Red
)
fig.update_layout(
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828') # YouTube Dark Gray
)
fig.show()
Top 10 Trending Categories by View Count, Like Count and Comment Count¶
This grouped bar chart compares the top 10 most viewed YouTube video categories based on three key metrics: Total views (red), Total Likes (Dark Gray) and Total Comments (Rosy Brown). Comedy is the dominant category, with over 32.7 billion views, making it the most-watched and one of the most liked and commented categories.
import plotly.graph_objects as go
category_summary = (
UT_Videos_df.groupby('category_name')[['view_count', 'like_count','comment_count']]
.sum()
.sort_values(by='view_count', ascending=False)
.head(10)
.reset_index()
)
fig = go.Figure()
fig.add_trace(go.Bar(
x=category_summary['category_name'],
y=category_summary['view_count'],
name='Total Views',
marker_color='#FF0000' ,
text=category_summary['view_count'],
textposition='outside'
))
fig.add_trace(go.Bar(
x=category_summary['category_name'],
y=category_summary['like_count'],
name='Total Likes',
marker_color='#282828',
text=category_summary['like_count'],
textposition='outside'
))
fig.add_trace(go.Bar(
x=category_summary['category_name'],
y=category_summary['comment_count'],
name='Total Comments',
marker_color='rosybrown' ,
text=category_summary['comment_count'],
textposition='outside'
))
fig.update_layout(
title='Top 10 Trending Categories by View Count, Like Count and Comment Count',
xaxis_title='Category',
yaxis_title='Count',
barmode='group',
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828'),
legend=dict(title=None)
)
fig.show()
Total Views by Region¶
This horizontal bar chart displays the total number of YouTube views per region, sorted in descending order. Each bar represents one region, with the bar length corresponding to the total view count.
- Bangladesh leads all regions with the highest total view count, with more than 15 billion views.
- Other top performing regions include Malta, Pakistan, Malaysia, and Zimbabwe, each generating over 10 billion views.
import plotly.express as px
top_regions = (
UT_Videos_df.groupby('region_name')['view_count']
.sum()
.sort_values(ascending=False)
.reset_index()
)
fig = px.bar(
top_regions,
x='region_name',
y='view_count',
title='Total Views by Region',
labels={'region_name': 'Region', 'view_count': 'Total Views'},
color_discrete_sequence=['#FF0000']
)
fig.update_layout(
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828')
)
fig.show()
Top 10 Regions by View Count (Stacked by Category)¶
This stacked bar chart displays the total YouTube video view counts for the top 10 regions, segmented by the top 10 video categories. Each bar represents a region, and the height corresponds to the total number of views. The bars are colored by category, allowing for a visual comparison of categories performance within each region.
Malaysia has the highest total views among all regions shown, with strong contributions from Shorts, Comedy, Music, and Trailers categories.
Bangladesh follows closely, where Family, Gaming, and Documentary dominate the view count.
Comedy and Shorts consistently appear across multiple regions, indicating broad global appeal.
The presence of News & Politics in multiple regions, including Bangladesh, Slovenia, Vietnam, Malaysia, Pakistan, and Israel, highlights the regional interest in current events.
import plotly.express as px
region_category_views = ( UT_Videos_df.groupby(['region_name', 'category_name'])['view_count']
.sum()
.reset_index()
)
region_totals = (
region_category_views.groupby('region_name')['view_count']
.sum()
.sort_values(ascending=False)
)
top_10_regions = region_totals.head(10).index.tolist()
category_totals = (
region_category_views.groupby('category_name')['view_count']
.sum()
.sort_values(ascending=False)
)
top_categories = category_totals.head(10).index.tolist()
filtered_data = region_category_views[
(region_category_views['region_name'].isin(top_10_regions)) &
(region_category_views['category_name'].isin(top_categories))
]
fig = px.bar(
filtered_data,
x='region_name',
y='view_count',
color='category_name',
barmode='stack',
title='Top 10 Regions by View Count (Top 10 Categories Only)',
labels={
'region_name': 'Region',
'view_count': 'Total Views',
'category_name': 'Category'
}
)
fig.update_layout(
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828'),
legend_title_text='Category'
)
fig.show()
Views vs Likes Colored by Category¶
This scatter plot illustrates the relationship between likes and views across YouTube videos, with each dot representing a video. The dots are colored by video category. There is a clear positive correlation between likes and views videos with higher likes generally have more views. The distribution of categories across the chart suggests that engagement is not confined to a single type of content.
fig = px.scatter(
UT_Videos_df,
x='like_count',
y='view_count',
color='category_name',
title='Views vs Likes Colored by Category',
labels={'like_count': 'Likes', 'view_count': 'Views', 'category_name': 'Category'},
hover_data=['title', 'channel_title']
)
fig.show()
Data Analysis¶
This section explores YouTube trending video data, leveraging visual exploration, correlation analysis, and feature-driven insights to investigate key research questions. It incorporates univariate and multivariate analyses to reveal trends in regional preferences, engagement dynamics, and the emotional tone of video descriptions.
Q1: Regional Preferences in Video Categories¶
Objective: Identify the most popular video categories in each region based on likes and views.
The popularity of video categories significantly differs across regions, with some categories being universally popular and others reflecting regional preferences.
Top Video Categories by Region - View Count¶
Globally dominant Categories
- Action/Adventure
Leads in 21 countries including: India, United Kingdom, Mexico, Brazil, Greece, Morocco, and Thailand This indicates widespread appeal across Europe, South Asia, Latin America, and North Africa.
- Anime/Animation
Tops the chart in 16 countries: Chile, Cambodia, Saudi Arabia, Sri Lanka, Venezuela High popularity in Asia, Middle East, and Latin America.
- Autos & Vehicles
Most viewed in 8 countries:France, Japan, Kenya, Czechia Appealing in Europe, Africa, and Asia.
- Entertainment
Most popular in: Azerbaijan, Egypt, North Macedonia, Peru, Turkey, Zimbabwe Maintains broad appeal, particularly in Middle East and South America.
Regional Popular
- Film & Animation
Only top in Bangladesh and Vietnam. This suggests localized cultural engagement.
- Drama
Each dominates in just a few regions. Estonia, Hungary, Indonesia, Ireland, Laos, Moldova and Spain.
- Documentary: Iraq, Lebanon, Uganda, etc.
Preserve history, traditions, and events, offering insight into past decisions
- Science & Technology
Tops in : Germany and Russia Reflects high tech engagement regions.
- Music
Surprisingly leads in only Finland Suggests that while music is globally liked, it may not consistently rank in top trending by average views across different regions.
import pandas as pd
import plotly.express as px
UT_Videos_df = pd.read_csv('data/YouTubeTrending.Videos.csv')
top_categories = UT_Videos_df.groupby(['region_name', 'category_name'])['view_count'].mean().reset_index()
top_categories = top_categories.loc[top_categories.groupby('region_name')['view_count'].idxmax()]
top_categories = top_categories.sort_values(by='region_name')
fig = px.bar(top_categories, x='region_name', y='view_count', color='category_name',
title='Top Categories by Region (Average View Count)',
labels={'region_name': 'Region', 'view_count': 'Average View Count', 'category_name':'Category'},
color_discrete_sequence=px.colors.qualitative.Plotly)
fig.update_layout(
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828'),
legend_title_text='Category'
)
fig.show()
categories = top_categories.groupby(['category_name'])['region_name'].unique().reset_index()
categories['region_name'] = categories['region_name'].apply(lambda x: ', '.join(x))
categories = categories.sort_values(by='category_name')
categories.columns = ['Category', 'Countries']
print(categories.to_string(index=False))
Category Countries
Action/Adventure Algeria, Argentina, Bahrain, Brazil, El Salvador, Greece, Honduras, Iceland, India, Liechtenstein, Malta, Mexico, Montenegro, Morocco, Panama, Romania, Serbia, Slovakia, Thailand, Tunisia, United Kingdom
Anime/Animation Belarus, Cambodia, Chile, Dominican Republic, Guatemala, Israel, Jamaica, Libya, Luxembourg, Oman, Saudi Arabia, South Africa, Sri Lanka, Tanzania, Ukraine, Venezuela
Autos & Vehicles Czechia, Denmark, France, Japan, Kenya, Nigeria, Senegal, Uruguay
Classics Belgium, Ecuador, Latvia, Lithuania, Qatar, Switzerland, Taiwan
Comedy Nicaragua, Paraguay, Singapore
Documentary Bosnia and Herzegovina, Hong Kong, Iraq, Lebanon, Pakistan, Philippines, Uganda, Yemen
Drama Estonia, Hungary, Indonesia, Ireland, Laos, Moldova, Spain
Education Canada, South Korea, United Arab Emirates
Entertainment Azerbaijan, Egypt, North Macedonia, Peru, Turkey, Zimbabwe
Family Croatia
Film & Animation Bangladesh, Vietnam
Foreign Bolivia, Bulgaria, Colombia
Gaming Ghana, Papua New Guinea, Portugal
Horror Costa Rica
Howto & Style New Zealand
Movies Poland
Music Finland
News & Politics Italy, Kazakhstan, Kuwait, Malaysia, Puerto Rico
Nonprofits & Activism United States
Pets & Animals Cyprus
Science & Technology Germany, Russia
Short Movies Netherlands
Shorts Georgia, Sweden
Shows Australia
Sports Austria, Norway
Thriller Nepal
Videoblogging Jordan, Slovenia
Top Video Categories by Region - Likes Count¶
Globally dominant Categories
- Action/Adventure
Most liked in 23 countries: Argentina, Greece, Croatia, South Korea, United States, Thailand, Malaysia, Mexico, Colombia, Kazakhstan This category is a consistent leader in both views and likes, indicating strong entertainment appeal.
- Anime/Animation
Tops likes in 16 countries: Cambodia, France, El Salvador, South Africa, Iceland, Saudi Arabia, Ukraine Strong presence in animation-loving cultures across Asia, Europe, and Africa.
- Autos & Vehicles
Countries: Brazil, Canada, Costa Rica, Czechia, Denmark, Ghana, Japan, Nigeria, Senegal, Uruguay Indicates strong interest in car culture in certain regions.
Regional Popular
- Entertainment
Azerbaijan, Germany, Jamaica, Peru, Turkey, Zimbabwe Consistent favorite, particularly in Middle East and South America
- Gaming
Most liked in: Dominican Republic, Libya, Portugal
- Education
United Arab Emirates, Vietnam
- Documentary
Bosnia and Herzegovina, Georgia, Hong Kong, Iraq, Kenya, Lebanon, Morocco, Pakistan, Serbia, Uganda
- Drama
Ecuador, Indonesia, Laos, Moldova, Nepal, Venezuela
import pandas as pd
import plotly.express as px
top_categories2 = UT_Videos_df.groupby(['region_name', 'category_name'])['like_count'].mean().reset_index()
top_categories2 = top_categories2.loc[top_categories2.groupby('region_name')['like_count'].idxmax()]
top_categories2 = top_categories2.sort_values(by='region_name')
fig = px.bar(top_categories2, x='region_name', y='like_count', color='category_name',
title='Top Categories by Region (Average Likes Count)',
labels={'region_name': 'Region', 'like_count': 'Average Likes Count','category_name':'Category'},
color_discrete_sequence=px.colors.qualitative.Plotly)
fig.update_layout(
plot_bgcolor='#FFFFFF',
paper_bgcolor='#FFFFFF',
font=dict(color='#282828'),
legend_title_text='Category'
)
fig.show()
categories2 = top_categories2.groupby(['category_name'])['region_name'].unique().reset_index()
categories2['region_name'] = categories2['region_name'].apply(lambda x: ', '.join(x))
categories2 = categories2.sort_values(by='category_name')
categories2.columns = ['Category', 'Countries']
print(categories2.to_string(index=False))
Category Countries
Action/Adventure Argentina, Belgium, Bolivia, Colombia, Croatia, Greece, Israel, Kazakhstan, Liechtenstein, Malaysia, Malta, Mexico, Montenegro, Oman, Philippines, Romania, Slovakia, Slovenia, South Korea, Thailand, United States
Anime/Animation Belarus, Cambodia, El Salvador, France, Honduras, Iceland, Luxembourg, New Zealand, Papua New Guinea, Saudi Arabia, South Africa, Sri Lanka, Tanzania, Ukraine
Autos & Vehicles Brazil, Canada, Costa Rica, Czechia, Denmark, Ghana, Japan, Nigeria, Senegal, Uruguay
Classics Algeria, Bahrain, Egypt, India, Lithuania, North Macedonia, Qatar, Switzerland, Taiwan
Comedy Latvia, Nicaragua, Paraguay, Singapore
Documentary Bosnia and Herzegovina, Georgia, Hong Kong, Iraq, Kenya, Lebanon, Morocco, Pakistan, Serbia, Uganda
Drama Ecuador, Indonesia, Laos, Moldova, Nepal, Venezuela
Education United Arab Emirates, Vietnam
Entertainment Azerbaijan, Germany, Jamaica, Peru, Turkey, Zimbabwe
Family Ireland, Panama
Film & Animation Bangladesh, Chile, Guatemala, United Kingdom
Foreign Bulgaria, Russia, Spain
Gaming Dominican Republic, Libya, Portugal
Horror Estonia
Movies Poland
Music Finland
News & Politics Italy, Kuwait, Puerto Rico
Pets & Animals Cyprus, Sweden
Sci-Fi/Fantasy Hungary, Yemen
Short Movies Netherlands
Shows Australia, Tunisia
Sports Austria, Norway
Videoblogging Jordan
Conclusion Q1¶
Action/Adventure is the most popular category worldwide, with high views and likes. Anime/Animation and Autos & Vehicles engage audiences across multiple regions, while Entertainment is especially popular in the Middle East and Latin America. Regional preferences matter like Documentary, Drama, and Education succeed in certain areas because of local interests and values.
Q2: How Does the Category of a YouTube Video Correlate with the Number of Likes It Receives?¶
Objective: Assessing how a video's category affects likes, revealing its impact on audience engagement.
ANOVA (Analysis of Variance)¶
- Null Hypothesis (H₀): All categories have the same average number of likes.
- Alternative Hypothesis (H₁): At least one category has a significantly different average number of likes.
from scipy.stats import f_oneway
groups = [group['like_count'].dropna() for name, group in UT_Videos_df.groupby('category_name')]
f_statistic, p_value = f_oneway(*groups)
print("F-statistic:", f_statistic)
print("p-value:", p_value)
F-statistic: 1.3747463013472645 p-value: 0.08310450457835931
Due to the p-value result of 0.0831, which is higher than 0.005, I failed to reject the null hypothesis. This means that , based on the current dataset, there is no strong statistical evidence that the number of likes differs accross video categories.
Correlation Analysis¶
In this section, we explore the relationship between video categories (a categorical variable) and like count (a numerical metric) using correlation techniques adapted for categorical data. Pearson’s correlation needs numbers, but category_name is categorical, so we can't use it directly. Instead, we apply encoding to convert categories into a format for correlation analysis.
Method: One-Hot Encoding + Correlation¶
- One-Hot Encoding:creates binary columns (1=true and 0=false)
- Pearson Correlation: measures linear association
encoded = pd.get_dummies(UT_Videos_df['category_name'])
encoded['like_count'] = UT_Videos_df['like_count']
correlation_matrix = encoded.corr()
category_like_corr = correlation_matrix['like_count'].sort_values(ascending=False)
print(category_like_corr)
like_count 1.000000 Gaming 0.013508 Comedy 0.011744 Shorts 0.011497 Music 0.010646 News & Politics 0.008911 Family 0.008722 Thriller 0.006777 Documentary 0.006641 Howto & Style 0.004766 Action/Adventure 0.002123 Classics 0.001257 Pets & Animals 0.000799 Anime/Animation 0.000684 Movies 0.000498 People & Blogs -0.000142 Entertainment -0.000329 Foreign -0.000352 Drama -0.001002 Trailers -0.001479 Science & Technology -0.001690 Shows -0.001786 Sci-Fi/Fantasy -0.002390 Videoblogging -0.002659 Nonprofits & Activism -0.002932 Short Movies -0.004856 Horror -0.005372 Education -0.005824 Autos & Vehicles -0.008771 Sports -0.010763 Travel & Events -0.011405 Film & Animation -0.013108 Name: like_count, dtype: float64
Gaming, Comedy, Shorts, Music, and News & Politics are positively associated with higher like counts, although the correlation strength is weak (all values are below 0.02). This indicates that category alone has limited predictive power for the number of likes. However, these categories are slightly more likely to receive higher engagement in terms of likes compared to others.
Conclusion Q2¶
Based on the statistical analysis conducted, there is no strong evidence to suggest that the number of likes significantly differs across video categories. The ANOVA test returned a p-value of 0.0831, which is higher than the 0.05 significance threshold. As a result, we fail to reject the null hypothesis, indicating that any observed differences in average likes across categories are not statistically significant within the current dataset.
However, the correlation analysis revealed that Gaming, Comedy, Shorts, Music, and News & Politics categories are positively associated with higher like counts. While the correlation values are weak (all below 0.02), these categories appear slightly more likely to receive higher user engagement.
This suggests that category alone has limited predictive power for like count, and that other factors such as content title, description, or others play a more influential role in determining user engagement.
Q3: Can we predict the number of likes or views a YouTube video will receive based on its description, category, or region?¶
Objective: To build a machine learning model that predict metrics (likes and views) using description, category_name and region_name
I decided to Random Forest because I'm working with categorical variables and text features.
- The data has to be prepare in case I still have some empty values and to also invlude stop_words as english
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_description = vectorizer.fit_transform(UT_Videos_df['description'].fillna(''))
- Convert categorical variables into dummies variables and then I will combine them
from scipy.sparse import hstack
X_category = pd.get_dummies(UT_Videos_df['category_name'])
X_region = pd.get_dummies(UT_Videos_df['region_name'])
X_combined = hstack([X_description, X_category.values, X_region.values])
- Log transformation to compress outliers
import numpy as np
y_likes = np.log1p(UT_Videos_df['like_count'])
y_views = np.log1p(UT_Videos_df['view_count'])
Import the necessary packages, define the test set percentage, initialize the model, fit it to the training data, generate predictions, and evaluate the model using R² score, mean absolute error (MAE), and root mean squared error (RMSE).
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(X_combined, y_likes, test_size=0.2, random_state=42)
model = RandomForestRegressor(random_state=42)
model.fit(X_train, y_train)
RandomForestRegressor(random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(random_state=42)
y_pred = model.predict(X_test)
print("R² Score:", r2_score(y_test, y_pred))
print("MAE:", mean_absolute_error(y_test, y_pred))
print("RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
R² Score: 0.7794941556611203 MAE: 0.5079956584117199 RMSE: 1.3426810054487652
Conclusion Q3¶
R² = 0.78 demonstrates strong predictive power, showing that it effectively captures 78% of engagement patterns based on description, category, and region.
Q4: How do emotions and sentiments in video descriptions relate to engagement metrics, such as likes, comments, and views? Specifically, do descriptions with emotional tones like joy and happiness result in a higher number of likes and comments?¶
Objective: To demostrative if positive emotions (e.g., joy, excitement) correlate with higher user engagement, especially in the form of likes and comments.
Description column will be cleanned from URLs, converted to lower case and punctuation will be removed
UT_Videos_df['description_cleaned'] = UT_Videos_df['description'].fillna('').str.lower()
Sentimental Analysis¶
The description will get a polarity score as Positive (>0.1), Neutral(-0.1 - 0.1) or Negative (<-0.1).
from textblob import TextBlob
def label_sentiment(score):
if score > 0.1:
return 'Positive'
elif score < -0.1:
return 'Negative'
else:
return 'Neutral'
UT_Videos_df['sentiment_score'] = UT_Videos_df['description_cleaned'].apply(lambda x: TextBlob(x).sentiment.polarity)
UT_Videos_df['sentiment_label'] = UT_Videos_df['sentiment_score'].apply(label_sentiment)
q4_summary = UT_Videos_df.groupby('sentiment_label')[['like_count', 'comment_count', 'view_count']].mean().reset_index()
print(q4_summary)
sentiment_label like_count comment_count view_count 0 Negative 152009.142705 5259.974441 4.694525e+06 1 Neutral 468165.640967 3679.454178 1.465515e+07 2 Positive 790457.261765 5380.212386 2.137299e+07
import pandas as pd
import matplotlib.pyplot as plt
fig, axs = plt.subplots(3, 1, figsize=(10, 12), constrained_layout=True)
axs[0].bar(q4_summary['sentiment_label'], q4_summary['like_count'], color='skyblue')
axs[0].set_title('Average Likes by Sentiment')
axs[0].set_ylabel('Average Likes')
axs[1].bar(q4_summary['sentiment_label'], q4_summary['comment_count'], color='lightgreen')
axs[1].set_title('Average Comments by Sentiment')
axs[1].set_ylabel('Average Comments')
axs[2].bar(q4_summary['sentiment_label'], q4_summary['view_count'], color='salmon')
axs[2].set_title('Average Views by Sentiment')
axs[2].set_ylabel('Average Views')
plt.suptitle('Engagement Metrics by Sentiment in Video Descriptions', fontsize=16)
plt.show()
Conclusion Q4¶
This results provides a clear evidence that Videos with positive emotional tones have the highest average likes, comments, and views. This supports the hypothesis "Positive emotions in descriptions are associated with higher engagement".
Conclusion¶
This study investigated the patterns and drivers of engagement in YouTube’s top 50 trending videos across multiple countries, focusing on category popularity, engagement metrics, predictive modeling, and emotional tone in video descriptions.
Regional Preferences in Video Categories: the analysis confirmed that video category popularity significantly varies by region. Action/Adventure emerged as the most popular category globally, while Anime/Animation and Autos & Vehicles were also widely favored across diverse regions.
Correlation Between Category and Likes: although descriptive statistics suggested higher engagement for categories like Gaming, Comedy, Shorts, and Music, the ANOVA test yielded a p-value of 0.0831, indicating that differences in like counts across categories are not statistically significant.
Predicting Likes and Views: Machine learning models, particularly Random Forest Regression, demonstrated strong predictive power, achieving an R² score of 0.78. This means that 78% of the variation in likes and views can be explained by description content, region, and category.
Sentiment and Emotional Tone in Descriptions: Sentiment analysis revealed that videos with positive emotional tone (e.g., joy, excitement) had the highest average likes, comments, and views.
Across all research questions, the findings highlight the complex and multi-dimensional nature of user engagement on YouTube. While category and region shape content visibility and interest, emotional tone and metadata such as video descriptions contribute meaningfully to how audiences respond.
These insights help shape content, marketing, and platform algorithms, making media more targeted, culturally relevant, and emotionally engaging.